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ABSTRACT 


This  paper  describes  an  analysis  of  hardware  related  software  errors  on 
the  MVS  operating  system  at  the  Center  for  Information  Technology  tCIT) 
at  Stanford  University.  The  study  first  examines  the  software  error 
detection  mechanisms  with  particular  reference  to  the  detection  of  soft¬ 
ware  errors  related  to  temporary  and  permanent  hardware  problems.  About 
11  percent  of  all  software  errors  and  over  40  percent  of  all  software 
failures  were  found  to  be  hardware  related.  It  is  shown  that  the  system 
is  seldom  able  to  diagnose  the  fact  that  a  software  error  may  be  hard¬ 
ware  related.  Key  patterns  in  the  occurrence  of  hardware  related  soft¬ 
ware  errors  are  determined  and  their  effect  on  system  recovery  examined. 
In  a  HU/SU  record.  both  the  hardware  and  the  software  errors  occur  in 
large  clusters  and  have  a  significant  percentage  of  lost  records  associ¬ 
ated  with  them.  The  system  recovery  management  is  less  likely  to 
recover  from  hardware  related  software  errors  than  software  errors  in 
general.  It  is  suggested  that  the  error  patterns  found  in  this  study 
could  form  the  basis  for  the  detection  and  recovery  management  of  hard¬ 
ware  related  software  errors. 

Keywords:  Software  reliability,  hardware/sof tuare  interactions,  recov¬ 
ery 
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1.  INTRODUCTION 


The  design  of  reliable  and  fault  tolerant  software  systems  is  one  of  the 
most  important  issues  facing  computer  designers  today.  Software  cost 
and  reliability  are  the  major  problem  areas  affecting  modern  computer 
systems.  The  question  of  hardware  and  software  interaction  and  its 
effect  on  system  reliability  is  particularly  difficult  to  comprehend. 
It  is  further  compounded  by  the  lack  of  availability  of  real  data.  It 
is  our  view  that  results  based  on  actual  measurements  and  experiments 
are  essential  for  developing  a  clear  understanding  of  the  problem. 

The  MVS  system  on  the  on  the  IBM  3081  at  the  Center  for  Information 
Technology  (C1T)  at  Stanford  University,  provided  an  ideal  opportunity 
in  this  regard.  The  operating  system  automatically  collects  information 
on  error  detection  and  correction.  The  state  of  the  machine  at  the 
time  of  the  error  is  also  recorded.  CIT  is  the  main  campus  computation 
facility.  It  is  used  for  production  programs  (payrolls  and  administra¬ 
tion),  student  and  research  projects,  and  for  general  purpose  computing. 
The  installation  consists  of  two  IBM  3081  processors  which  run  the  MVS 
operating  system.  The  two  processors  are  loosely  coupled,  e.g.,  they 
have  distinct  control  programs  and  different  I/O  configurations.  On  a 
typical  day,  the  two  systems  support  around  S00  users  and  run  approxi¬ 
mately  4000  batch  jobs. 

The  general  objective  of  this  study  was  to  determine  the  extent  and 
impact  of  temporary  and  permanent  hardware  errors  on  the  operating  sys¬ 
tem.  The  analysis  differentiates  between  the  terms  "error"  and  "fail¬ 
ure".  A  failure  is  an  "error"  which  causes  the  termination  of  the  sys¬ 
tem  (i.e.,  a  system  failure).  Thus  an  error,  in  general,  may  or  may  not 
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result  in  a  failure.  It  is  generally  believed  that  the  operating  system 
is  not  always  able  to  diagnose  a  software  error  related  to  a  hardware 
error  or  failure.  We  define  this  as  a  hardware  related  software  error 
and  denote  it  as  a  "HW/SU"  error.  Note  that  the  relationship  may  be 
either  cause  and  effect  (i.e.  the  hardware  error  caused  the  software 
error)  or  symptomatic  (i.e  both  the  hardware  error  and  the  software 
error  are  symptoms  of  another,  yet  unidentified,  problem).  A  HU/SW 
error  is  further  subdivided  as  follows: 

1.  Software  errors  found  related  to  temporary  hardware  errors 
(denoted  by  "HW/SU-Temp . ") . 

2.  Software  failures  found  related  to  permanent  hardware  failures 
(denoted  by  "HW/SW-Perm. ") . 

We  commence  by  analysing  the  error  detection  facilities  in  MVS  with 
particular  reference  to  hardware  related  problems.  The  most  common 
types  of  hardware-related  software  errors  are  identified  and  their  rela¬ 
tive  frequencies  found.  Finally  the  impact  of  HW/SW  errors  on  the  sys¬ 
tem  is  evaluated  by  measuring  the  effectiveness  of  system  recovery  in 
handling  hardware  related  software  errors. 

The  approach  adopted  was  to  start  with  a  substantial  quantity  of  high 
quality  data  on  all  software  errors  (recoverable  and  non-recoverable) . 
The  data  on  error  detection  and  recovery  is  automatically  logged  by  the 
operating  system.  An  error  collection  mechanism  which  selected  and  fil¬ 
tered  the  raw  data  so  as  to  cluster  records  referring  to  the  same  error, 
was  developed  (see  [Velardi  833  for  details).  The  data  set  so  obtained 
was  then  merged  with  data  sets  of  temporary  and  permanent  hardware 


errors. 


The  data  on  temporary  hardware  errors  came  from  channel  and 
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disk  error  logs  [IBM  79].  Data  on  permanent  hardware  problems  came  from 
UNILOG  [Butner  BO],  an  installation  (CIT)  maintained  log  of  failures  and 
repair. 

The  important  results  of  the  study  are  summarised  below: 

1.  The  operating  system  is  seldom  able  to  diagnose  the  fact  that  a 
software  error  may  be  hardware  related. 

2.  About  11  percent  of  all  software  errors  were  determined  to  be 
hardware  related. 

3.  Over  40  percent  of  all  software  failures  were  found  to  be  hard¬ 
ware  related. 

4.  The  key  pattern  of  a  HU/SW  record  is  that  both  the  hardware  and 
the  software  records  occur  in  large  clusters  and  have  a  signifi¬ 
cant  percentage  of  lost  records  associated  with  them. 

5.  The  system  recovery  management  is  less  effective  in  handling 
hardware  related  software  errors  than  software  errors  in  general. 

Before  describing  this  work  in  detail,  an  overview  of  related  research 
in  this  area  is  presented. 

2.  RELATED  RESEARCH  AND  MOTIVATION 

Designing  hardware  systems  that  tolerate  faults  is  relatively  well 
understood,  at  least  from  a  theoretical  viewpoint.  However,  the  problem 
of  software  fault  tolerance  has  yet  to  be  well  investigated  [Hecht 
80a, b] . 

The  term  "software  reliability  model"  is  usually  taken  to  mean  mathe¬ 
matical  relationships  for  assessing  the  reliability  of  software  (in 
terms  of  statistical  parameters  such  as  Mean  Time  Between  Failures)  dur- 
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ing  the  development  debugging  or  testing  phases.  A  few  of  these  models 
have  also  been  applied  in  follow-up  operational  phases.  Several  compet¬ 
ing  models  have  appeared  in  the  literature  (see  [Musa  1980]  for 
details),  and  a  number  of  authors  have  attempted  to  analyse  their  suit¬ 
ability.  An  appreciation  of  the  extent  and  nature  of  this  discussion 
can  be  obtained  from  [Goel  80].  The  main  difficulty  with  these 
approaches  is  that,  although  each  model  appears  to  be  valid  within  its 
own  assumptions,  there  is  insufficient  experimental  evidence  available 
to  judge  their  general  validity. 

Research  most  closely  related  to  the  present  viy  is  ’ n  the  area  of 
analysis  of  errors  and  their  causes  in  large  sof  •-  systems.  [Endres 
75]  discusses  and  categorises  errors  and  error  .quencies  during  the 
internal  testing  phase  of  the  IBM  DOS/VS  system.  In  [Thayer  78]  data 
collected  from  four  large  software  development  projects  is  analysed. 
[Hamilton  78]  applies  the  well  known  execution  time  model  [Musa  80]  to 
measure  the  operational  reliability  of  computer  center  software,  and 
[Glass  80]  examines  the  occurrence  of  persistent  bugs  and  their  causes 
in  operational  software.  Another  useful  study  is  [Maxwell  78],  which 
tabulates  and  examines  error  statistics  on  software. 

None  of  these  studies  tries  to  relate  system  reliability  or  the  error 
frequencies  to  the  usage  environment  of  the  software  itself  in  a  system¬ 
atic  manner.  Results  based  on  such  measurements  are  essential  in  order 
to  evaluate  the  system  fault  tolerance  and  automatic  recovery  features. 

In  an  early  study  of  failures  at  the  SIAC  (Stanford  linear  Accelera¬ 
tor  Center)  computation  facility,  [Butner  80]  and  [Iyer  82a]  found  a 
strong  correlation  between  the  occurrence  of  failures  and  the  level  of 
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system  activity  at  the  time  of  failure.  A  more  detailed  and  accurate 
analysis  of  failures  on  a  Vfl/370  system  (in  service  at  SIAC  since  Febru¬ 
ary  1981)  confirmed  this  relationship  [Rossetti  823.  In  addition  this 
study  found  that  a  significant  proportion  (16  percent)  of  softuare-re- 
lated  system  failures  were  due  to  hardware  problems.  In  many  of  these 
cases  it  uas  determined  that  the  system  should  have  been  designed  to 
continue  operation  at  least  in  a  degraded  mode.  To  the  authors'  knowl¬ 
edge  there  are  no  other  experimental  studies  reported  in  the  literature 
on  hardware-software  interaction. 

More  recently  [Velardi  83]  analysed  the  error  recovery  facilities  on 
the  HVS  system.  Data  on  error  recovery  showed  that  the  system  fault 
tolerance  almost  doubles  when  recovery  routines  are  provided  for  failing 
programs,  in  comparison  with  the  case  where  only  system  provided  recov¬ 
ery  management  is  available.  The  recovery  routines  are  most  effective 
in  handling  storage  management  problems  (an  important  feature  of  MVS). 
However,  even  when  recovery  routines  are  provided,  there  is  almost  a  50X 
chance  of  system  failure  when  critical  system  jobs  are  involved.  Thus 
there  is  still  considerable  scope  for  improvement.  Deadlocks,  I/O  and 
data  management,  and  exceptions  are  the  main  problem  areas. 

Finally,  a  preliminary  examination  of  the  data  appeared  to  indicate 
that  the  error  detection  in  MVS  is  not  always  able  to  diagnose  software 
problems  resulting  from  a  hardware  failure.  It  uas  clear  that  further 
analysis  uas  necessary  to  fully  understand  this  problem. 
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3.  THE  DATA  BASE 

The  automatic  detection  of  a  software  error  in  MVS  can  be  through  hard¬ 
ware  or  software  facilities.  Hardware  detects  conditions  such  as  over¬ 
flows.  addressing  or  divide  exceptions  and.  is  generally  used  to  protect 
storage  or  other  system  resources  from  unauthorised  access.  Hardware 
detection  manifests  itself  as  a  program  interruption  (program  check). 
Software  detects  more  complex  conditions  such  as  an  incorrect  parameter 
specification  in  a  macro  or  the  unvalid  use  of  control  statements.  Data 
on  the  type  of  detection  (hardware  or  software)  and  recovery  is  logged 
by  the  system  on  to  a  data  set  called  SYS1.L0GREC.  A  description  of 
error  detection  and  recovery  processing  in  MVS  appears  in  Appendix  A 
and,  in  [IBM  79]. 

Initially,  the  SYS1.L0GREC  data  set  (which  is  in  hexadecimal  code), 
was  compacted  in  order  to  extract  the  relevant  information,  and  to  pro¬ 
vide  explanations  for  hexadecimal  codes.  Then,  the  records  believed  to 
be  repeated  occurrences  of  the  same  problem  were  clustered.  The  number 
of  observations  in  a  cluster  (SWPOINTS,  HWPOINTS)  and  time  span  of  the 
cluster  (SUSPAN,  HWSPAN)  were  also  added  to  the  record.  The  result  of 
this  manipulation  was  a  data  set  ready  for  statistical  analysis.  The 
building  of  this  data  base  is  discussed  in  detail  in  [Velardi  S3]. 

3. 1  PROCESSING  THE  ERROR  DATA 

The  raw  LOGREC  data  includes  CPU,  channel,  and  device  errors  for  all 
equipment  in  the  installation.  Initially  the  software  records  on  the 
two  IBM  3081's  were  selected  for  this  analysis.  In  each  software  record 
there  are  a  number  of  bits  describing  the  type  of  error,  its  severity. 
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and  the  result  of  hardware  and  software  attempts  to  recover  from  the 
problem.  The  general  software  error  status  Indicators  provided  by  the 
hardware  and  software  are  TYPE  (of  detection),  EVENT  (causing  the  detec¬ 
tion)  and  ERRCOOE  (code  or  symptom  of  the  error).’  For  the  purposes  of 
this  study  two  additional  data  sets  which  contained  information  on  hard- 
ware/sof tware  interaction  were  also  generated: 

1.  Software  errors  found  related  to  temporary  hardware  errors  (HU/ 
SU-Temp . ) . 

2.  Software  failures  found  related  to  permanent  hardware  failures 
(HU/SU-Perm. ) . 

The  HU/SU-Perm.  data  set  was  created  by  matching  the  software  rewards 
with  the  log  (UNILOG)  of  all  hardware  failures  manually  maintained  at 
CIT.  The  matched  records  were  then  inspected  to  confirm  that  the 
resulting  data  (nearly  70  failures)  did  indeed  correspond  to  hardware- 
related  software  failures.  The  HU/SU-Temp.  data  set  was  obtained  by 
matching  the  software  errors  with  temporary  channel  and  disk  problems. 
The  data  on  channel  problems  came  from  the  Channel  Check  (CCH)  records 
and  from  Missing  Interruption  Handl inq  (MIH)  records.  The  data  on  disk 
errors  came  from  the  system  outboard  records  (OBR).  Again  the  merged 
data  set  was  carefully  inspected  to  confirm  that  the  records  reasonably 
well  corresponded  to  hardware-related  software  errors.  Table  1  provides 
brief  descriptions  of  the  sources  of  data  employed  in  this  study  (see 
[IBM  79]  and  [Butner  BO]  for  a  detailed  description  of  these  records). 
A  sample  of  the  hardware-related  software  records  is  given  in  Fig.  1.  A 
summary  of  the  data  appears  in  Table  2.  Interesting  frequency  plots  of 
the  data  are  given  in  Appendix  B. 

’  The  IBM  names  for  these  fields  are  [IBM  79]:  TYPE  -  HORTYP;  EVENT  - 
SOUERRA;  ERRCOOE  -  SOUACMPC. 
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TABLE  1 


Sources  of  data 


Type  of  Record  Explanation 


Channel  Check  Record  These  records  are  generated  for 

(CCH)  every  channel  error  (includes 

Channel  Control  Checks.  Channel 
Data  Checks  and  Interface  Control 
Checks).  CCH's  are  temporary 
hardware  errors  and  do  not  result 
in  system  termination. 


Missing  Interruption  MIH  records  are  due  to  missing  or 

Handling  (MIH)  pending  device  and  channel  end 

interruptions. 


Out  Board  Records  (OBR)  OBR  records  are  generated  for  a 

wide  range  of  events  (normal  and 
abnormal).  The  category  used  in 
this  analysis  is  temporary  and 
permanent  device  errors. 


Software  Records  Software  records  are  generated 

for  selected  software  events. 
Examples  are  invalid  SVC,  program 
checks,  system  abends  or  user 
abends  which  request  a  recording. 


UNI  LOG  is  an  installation 
maintained  log  of  all  software 
and  hardware  component  and  system 
fail ures. 
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Sample 

of  harduare/sof tware  errors 

It  can  be  seen  from  the  data  in  Fig.  1  that  it  is  not  unusual  to  have 
more  than  one  software  record  for  a  permanent  hardware  problem  (i.e. 
HU/SW-permanent) .  Obseravations  2-4  and  12-13  are  some  examples.  For 
temporary  hardware  problems  note  that  not  only  some  of  the  observations 
are  very  close  in  time  they  also  refer  to  different  hardware  or  software 
problems.  For  example  observations  4  and  5  indicate  that  two  software 
errors  occurred  in  connection  with  a  channel  check  and  a  temporary  disk 
error  on  different  programs.  The  time  vicinity  of  those  errors  suggests 
that  the  cause  of  these  problems  was  common.  It  is  clear  that  the  sys¬ 
tem  was  not  able  to  diagnose  and  relate  these  records  (e.g.  two  SU 
records,  one  CCH  and  one  OBR,  for  the  temporary  hardware  problem).  A 
detailed  analysis  of  the  data  (both  HU/SU-perm.  and  HU/SU-temp.)  con¬ 
firmed  that  the  system  is  seldom  able  to  diagnose  a  hardware  related 
software  error. 

TABLE  2 

Summary  of  the  data 

Period  of  Study:  March  1982  -  May  1983 


Data  Set 

Source 

Freq. 

A1 1  SU  Errors 

SU  Records 

1547 

All  Permanent  HU  Failures 

UHI LOG 

264 

All  Temporary  HU  Errors 

CCH.  OBR 

4461 

SU  Errors  Related  to  Temporary 

HU  Errors 

SU  Records/ 

CCH,  OBR 

108 

SU  Errors  Related  to  Permanent 

HU  Failures 

SU  Records/ 

UHI LOG 

69 

11 


The  next  section  investigate  the  detection  of  software  errors.  Par¬ 
ticular  attention  is  paid  to  the  detection  of  software  errors  related  to 
temporary  and  permanent  hardware  problems. 

4.  ANALYSIS  OF  ERROR  DETECTION 

This  section  investigates  the  the  detection  of  software  errors  in  MVS. 
In  particular*  the  following  points  are  considered: 

1.  The  relationship  between  the  type  of  software  problem  and  the 
type  of  detection  (i.e.  hardware  or  software). 

2.  The  impact  of  hardware  or  software  detection  on  system  recovery. 

3.  The  detection  of  software  errors  found  to  be  hardware-related. 
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3.  Storage  management :  indicates  an  error  in  the  storage  alloca- 

tion/de-a)1ocation  process  or  in  virtual  memory  mapping. 

4.  Storage  exceptions :  indicates  addressing  of  non-existent  or 

inaccessible  memory  locations. 

5.  Programming  except i ons :  indicates  a  program  error  other  than  a 

storage  exception. 

6.  Deadlocks:  indicates  a  system  or  operator  detected  endless  loop, 
endless  wait  state  or  violation  of  system  or  user  defined  time 
1 imi ts. 

7.  lost  Records :  indicates  that  the  error  recording  process  uas 

itself  affected  by  an  error. 

4.2  ERROR  DETECTION  STATISTICS 

There  are  significant  differences  in  the  error  distributions  between  the 
tuo  detection  mechanism.  Table  3  gives  the  percentage  distribution  of 
the  errors  during  the  analysed  period.  On  the  average,  the  tuo  major 
error  categories  are  storage  exceptions  C 2 55C )  and  storage  management 
(26/C) . 

It  can  be  seen  that  all  exception  type  problems  are  detected  by  hardware 
and  storage  management  type  problems  are  detected  by  software.  In  the 
case  of  control  and  I/O  problems,  it  is  found  that  almost  twice  as  many 
are  software-detected.  An  analysis  of  the  hardware-detected  control  and 
I/O  problems  showed  that  these  were  in  fact  forced  program  checks  and 
were  detected  as  a  result  of  specific  software  traps.  Note  from  Table  3 
that  storage  related  problems  dominate  both  hardware  and  software-de- 
Recall  that  a  major  feature  of  the  MVS  operating  system 


i ' 


tected  errors. 
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TABLE  3 

Distribution  of  error  categories 


Hardware 

Detected 

Software 

Detected 

All 

Error  type 

Freq. 

X 

Freq. 

X 

X 

Storage  management 

11 

1.9 

395 

44.2 

26.2 

Storage  exceptions 

382 

67.0 

0 

0.0 

24.7 

Deadlocks 

0 

0.0 

310 

34.6 

20.2 

I/O  and  data  management 

45 

7.9 

116 

13.0 

10.5 

Programming  exceptions 

114 

19.9 

0 

0.0 

7.4 

Control 

18 

3.2 

50 

5.6 

4.4 

Inval id 

1 

0.1 

23 

2.6 

6.6 

ALL 

571 

100.0 

894 

100.0 

100.0 

is  the  multiple  virtual  storage  organisation.  Storage  management  is  a 
high  volume  activity  and  is  critical  to  the  proper  operation  of  the  sys¬ 
tem.  One  might  therefore  expect  its  contribution  to  errors  to  be  sig¬ 


nificant. 
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.3  ERROR  DETECTION  AND  RECOVERY 

In  MVS  the  system  can  recover  from  an  error  by  a  retry  or  by  aborting 
the  job  or  task  (a  module  of  the  job)  in  progress  [IBM  80].  If  the  job 
or  task  is  critical  for  system  continuation,  abortion  will  cause  system 
failure.  Table  4  provides  information  on  hou  an  error  was  handled.  The 
table  shows  that  a  hardware-detected  error  is  more  likely  to  result  in  a 
system  failure  and  less  likely  to  be  retried  successfully  than  a  soft¬ 
ware-detected  error. 


TABLE  4 

Effectiveness  of  the  recovery 


Detection 

Freq. 

JOBTERM 

x 

TASKTERM 

X 

RETRY 

X 

FAILURE 

X 

Hardware 

571 

0.9 

45.2 

24.0 

29.9 

Software 

894 

20.0 

26.6 

35.6 

17.7 

All* 

1547 

13.0 

33.5 

31.1 

22.4 

*  This  includes  Lost  Records  and  Operator 

detected 

errors  also. 

Recovery  routines  are  specified  in  MVS  for  major  system  functions 
[Auslander  82].  Table  5  relates  the  provision  of  recovery  routines  to 
the  detection  mechanisms.  Ue  find  that  recovery  routines  are  specified 
for  almost  twice  as  many  software-detected  problems  than  for  hardware- 
detected.  The  table  shows  that  software-detected  problems  are  better 
handled  (higher  chance  of  a  recovery  than  for  hardware-detected  prob- 
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TABLE  5 

Effect  of  recovery  routines 


Error  Type 

Recvy  Routine 
Provided 

% 

Failures 
(Rcvy  Routine 
Provided) 

54 

Failures 
(Rcvy  Routine 
Not  Provided) 
% 

Hardware 

43.5 

27.6 

31.6 

Software 

84.8 

13.3 

42.1 

All 

66.1 

16.8 

34.8 

lems).  In  both  cases  however  we  find  that  the  availability  of  a  recov¬ 
ery  routine  substantially  improves  the  recovery  probability.  An  impor¬ 
tant  reason  for  the  better  performance  of  software-detected  problems,  is 
due  to  the  fact  that  software  detects  most  (or  all)  management  type 
problems.  Since  storage  management  is  an  important  function  of  MVS  and 
it  is  more  carefully  designed  and  better  protected  by  recovery  routines. 
Also  the  system  has  more  information  available  regarding  a  software 
detected  problem  than  one  detected  by  the  hardware. 

A. A  D1J.LCJI0H  fil  HARDWARE-RELATED  SOFTWARE  ERRORS 

Our  previous  analysis  [Velardi  83]  appeared  to  indicate  that  the 
error  detection  mechanism  on  MVS  is  not  always  able  to  diagnose  software 
problems  resulting  from  a  hardware  failure.  Recall  that  the  error  data 
set  used  in  this  study  contains  information  on  software  errors  and  fail¬ 
ures  found  to  be  related  to  both  temporary  and  permanent  hardware  prob¬ 


lems. 
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TABLE  6 

HU/SU  errors  -  detection 


HU/SU- 

Temporary 

HU/SU- 

Permanent 

All 

SU» 

Oetection 

Freq . 

X 

Freq. 

X 

Freq . 

X 

Hardware 

23 

21.3 

27 

39.  1 

521 

38.0 

Software 

64 

59.  3 

30 

43.5 

800 

58.5 

Lost  record 

21 

19.4 

10 

14.5 

46 

3.4 

Operator 

0 

0.0 

2 

2.9 

1 

0.1 

Total 

108 

100.0 

69 

100.0 

1368 

100.0 

*  Note:  This  does  not  include  harduare-rel ated  problems 


Table  6  analyses  the  detection  of  software  errors  found  to  be 
hardware-related.  It  can  be  seen  from  Table  6  that  lost  records  are  a 
significant  proportion  of  hardware-related  software  errors.  Note  also 
the  fact  that  more  than  40  percent  of  all  lost  records  occur  in  combina¬ 
tion  with  a  HU/SU  error  (whereas  HU/SU  errors  are  only  11.0  percent  of 
all  software  errors).  This  seems  to  show  that  software  error  data  col¬ 
lection  itself  is  affected  by  the  occurrence  of  a  hardware  error.  Fur¬ 
ther  investigation  of  this  problem  revealed  that  the  job  name  of  the 
hardware  record  associated  with  the  software  error  tagged  "LOST"  gener¬ 
ally  indicated  a  system  critical  job.  In  addition  lost  records  commonly 
appear  in  very  large  clusters  indicating  the  persistency  of  a  problem 
and  usually  result  in  system  termination.  It  appears  from  the  data  that 
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such  an  occurrence  can  almost  always  be  considered  as  a  symptom  of  a 
hardware-re) ated  software  problem. 

5.  ANALYSIS  OF  HARDWARE  RELATED  SOFTWARE  PROBLEMS 
Thi*  section  analyses  temporary  and  permanent  hardware-related  software 
problems.  Significant  features  of  hardware-related  software  errors  are 
determined  and  their  effect  on  recovery  management  is  examined. 


TABLE  7 

Device  involvement  statistics 


Dev  ice 

HU/SW- 

Temporary 

HW/SW- 

Permanent 

All  HU/SW 

Freq. 

X 

(All  SW 
Errors) 

Freq. 

X 

(All  SW 
Failures) 

X 

(All  SW 
Errors) 

CPU/Channel 

76 

4.9 

20 

7.0 

6.1 

Disk 

32 

2.1 

42 

14.6 

4.8 

Other 

0 

0.0 

7 

2.4 

0.1 

Total 

108 

69 

24.0 

11.0 

Table  7  shows  the  frequency  and  percentage  of  hardware  devices  involved 
in  software  errors.  Disks  or  channels  are  almost  always  involved.  CPU 
and  channel  are  considered  together  because  the  IBM  3081  contains  both 
in  one  box  and  usually  a  channel  problem  also  effects  the  CPU.  The 
table  also  shows  that  about  11  percent  of  software  errors  are  found 
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related  to  a  hardware  problem.  About  7  percent  of  all  software  errors 
were  related  to  temporary  hardware  problems.  Nearly  25  percent  of  all 
software  fail ures  however  were  related  to  permanent  hardware  problems. 
The  statistics  on  permanent  hardware  failures  is  somewhat  higher  than 
the  results  on  VM/370  reported  in  [Rossetti  82].  That  study  found  16 
percent  of  all  software  failures  were  hardware-related. 

Table  8  provides  statistics  on  hardware  related  software  errors,  i.e. 
Time  Between  Errors,  the  number  of  records  (SUP01NTS)  in  a  cluster  (i.e. 
referring  to  the  same  problem)  and  the  time  span  (SUSPAN)  of  the  error 
(time  between  the  first  and  the  last  record  in  a  cluster).  It  is  noted 
that  both  HU/SU-Temp.  and  HU/SU-Perm.  have  larger  clusters  and  larger 
error  handling  times  (i.e.  SUSPAN)  in  comparison  with  all  SU  errors. 
The  permanent  failures  have  the  larger  times  of  the  two.  It  was  also 
observed  that  several  of  the  large  clusters  had  many  jobs  involved. 

Summarsing,  we  find  that  the  key  features  of  hardware-related  soft¬ 
ware  problems  are  that  they  are  very  likely  to  result  in  lost  records, 
occur  in  large  clusters  and  involve  many  jobs. 
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TABLE  8 

Statistics  on  harduare/sof tuare  interaction 


TIME  BETWEEN  ERRORS  (Hours) 


HU/SU-Temp. 

A1  1  SW 
Errors 

HW/SU-Perm. 

All  SU 
Fail ures 

Mean 

100.9 

7.9 

159.4 

44.8 

Standard  deviation 

208.3 

12.8 

304.8 

108.8 

Med i an 

26.2 

2.5 

43.5 

6.3 

SWSPAN 

(Seconds ) 

HW/SW-Temp 

A1  1  SW 
Errors 

HU/SW-Perm. 

A1  1  SW 
Fail ures 

Mean 

91.0 

49.  5 

205.4 

47.9 

Standard  deviation 

312.4 

203.8 

958.6 

183.7 

Median 

0.0 

0.0 

0.0 

0.0 

SUPOINTS 


HW/SU-Temp. 

All  SU 
Errors 

HW/SW-Perm. 

A1  1  SU 
Fail ures 

Mean 

11.5 

4.2 

16.9 

4.8 

Standard  deviation 

33.  1 

15.7 

60.3 

23.3 

Med i an 


2.0 


2.0 


4.0 


2.0 


5. 1  RECOVERY  OF  HAR PHARE-RELATED  SOFTWARE  ERRORS 

This  section  analyses  the  recovery  mangement  of  temporary  and  permanent 
hardware-related  software  problems.  Recall  that  in  handling  a  software 
problem  the  system  can  recover  by  issuing  a  retry,  or  by  aborting  the 
current  job  or  task  (a  module  of  the  job)  in  progress.  If  the  job 
involved  is  critical  for  system  continuation,  system  failure  will 
resul t. 


TABLE  9 

Specification  of  recovery  routines  for  HW/SI4  errors 


Error  Type 

Recvy  Routine 
Provided 

Y. 

Fail ures 
(Rcvy  Routine 
Provided) 

Y. 

Fail ures 
(Rcvy  Routine 

Not  Provided) 

Y. 

Temporary 

62.9 

20.6 

84.2 

Permanent 

46.4 

100.0 

100.0 

A1 1 

56.5 

46.0 

92.0 

Recovery  routines  are  specified  in  MVS  for  major  system  functions. 
Table  9  shows  that  software  errors  related  to  permanent  hardware  fail¬ 
ures  have  a  lower  probability  of  having  recovery  routines  specified  than 
software  errors  related  to  temporary  hardware  errors  or  normal  software 
errors.  The  figure  is  almost  a  third  lower.  Comparing  Tables  9  and  5, 
it  is  also  clear  that,  although  recovery  routines  are  specified  for 


almost  the  same  proportion  of  Hl4/SW-temporary  errors  as  for  all  software 
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errors,  they  are  not  nearly  as  effective.  In  addition,  the  percentage 
of  failures  uhen  recovery  routines  are  not  provided  is  substantially 
higher.  Thus,  the  system  recovery  management  is  significantly  less 
effective  in  handling  a  HW/SU  error  than  it  is  in  dealing  with  a  soft¬ 
ware  problem  in  general.  This  is  significant  since  it  points  to  a  par¬ 
ticularly  weak  aspect  of  the  system.  It  may  be  argued  that  a  better 
provision  of  recovery  routines  specifically  geared  toward  the  hardware- 
software  interaction  could  considerably  alleviate  the  problem. 


TABLE  10 

HU/SU-Temporary :  Recovery  management 


Error  type 

TOTAL 

Freq.  '/. 

CCH 

y. 

MIH 

•/. 

DISK 

y. 

Retry 

25 

23.2 

20.5 

18.9 

31.3 

Task  Term. 

16 

14.8 

10.3 

13.5 

21.8 

Job  Term. 

19 

17.6 

7.7 

43.2 

0.0 

Fail ure 

25 

23.2 

18.0 

16.2 

37.5 

Lost  Records 

23 

21.3 

43.6 

8.1 

9.4 

A1 1 

108 

100.0 

36. 1 

34.3 

29.6 

Tables  10  and  Table  11  provide  information  on  recovery  from  HU/SW-Tempo- 
rary  errors.  It  can  be  seen  from  the  table  10  that  MIH  (Missing  Inter¬ 
ruption  Handling)  causes  the  highest  job  and  task  terminations  and  sytem 
damage.  These  are  seen  from  table  11  to  be  most  closely  related  to 


22 


TABLE  11 

HU/SM-Temporary :  Error  types 


Error  type 

TOTAL 

Freq.  % 

CCH 

% 

DISK 

X 

MIH 

X 

Control 

4 

3.7 

0.0 

3.  1 

8.  1 

Deadlocks 

29 

26.9 

15.4 

0.0 

62.2 

I/O  and  data 
management 

7 

6.5 

2.6 

9.4 

8. 1 

Storage  management 

23 

21.3 

18.0 

31.3 

16.2 

Storage  exceptions 

12 

11.1 

15.4 

18.8 

0.0 

Programming 

exceptions 

8 

7.4 

0.0 

21.9 

2.7 

Unclassified 

25 

23.2 

48.7 

15.6 

2.7 

A1 1 

108 

100.0 

36.  1 

29.6 

34.3 

deadlocks.  This  is  quite  reasonable  since  MIH  are  due  to  interrupts 
which  are  not  completed  in  a  specified  time.  The  deadlocks  are  most 
commonly  due  to  the  detection  of  a  wait  state  or  an  endless  loop. 
Retries  are  also  the  lowest  for  MI H  since  most  of  them  are  deadlocks. 
More  than  40%  of  the  channel  related  software  errors  result  in  a  lost 
record.  Me  find  that  in  most  of  these  cases  both  the  hardware  and  the 
software  problem  have  large  clusters  associated  with  them.  The  disk  and 
channel  errors  most  commonly  manifest  themselves  as  storage  problems  or 
exceptions.  This  could  also  imply  that  the  real  problem  was  not  in  the 
channel  but  perhaps  in  main  storage  which  resulted  in  both  the  channel 
error  and  the  software  record. 
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It  is  significant  to  note  that  23X  of  HW/SW- temporary  errors  result 
in  system  failure.  Taking  this  and  HU/SU-permanent  failures  into 
account,  it  uas  found  that  nearly  35  percent  of  all  software  failures 
are  hardware  related.  In  addition,  it  was  found  that  most  of  the  lost 
records  also  resulted  in  system  termination.  Thus  the  true  percentage 
of  software  fai 1 ures  (±n  our  data)  which  are  hardware  related,  is  over 
40  Percent . 

In  summary,  the  analysis  shows  that  recovery  management  of  HM/SM 
errors,  is  significantly  less  effective  than  that  of  software  errors  in 
general.  In  many  of  these  cases  it  was  felt  that  the  system  could  have 
been  designed  to  continue  in  a  degraded  mode.  At  least  the  software 
should  be  capable  of  recognising  a  hardware  failure  and  take  the  offend¬ 
ing  component  off-line  or  put  the  system  in  a  wait  state.2 

Software  problems  refated  to  temporary  hardware  errors  are  not  well 
managed  either.  The  system  has  a  low  fault  tolerance  for  these  errors. 
Over  40  percent  of  all  software  failures  are  hardware  related.  It  is 
believed  that  an  important  reason  for  this  is  the  inadequate  communica¬ 
tion  between  the  hardware  and  software  regarding  the  occurrence  of 
errors.  If  a  hardware  error  uas  diagnosed  and  tagged  as  a  potential 
software  error,  it  is  possible  that  better  recovery  could  be  designed. 
This  would  be  especially  true  if  the  system  was  geared  to  recognise  cer¬ 
tain  patterns  in  these  errors  (such  as  those  observed  here)  and  classify 
them  as  potential  software  problems.  More  data  analysis  and  experimen¬ 
tation  is  necessary  before  this  can  be  achieved  in  a  reliable  manner. 


2  Although  this  capability  does  exist  in  MVS  in  handling  some  hardware 
problems  e.g.  channel  errors,  there  is  no  specific  provision  for  han¬ 
dling  HU/SU  errors  in  general. 
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6.  CONCLUSIONS 

It  has  been  the  purpose  of  this  paper  to  analyse  the  interaction  between 
hardware  and  software  as  it  relates  to  system  reliability.  It  was  seen 
that  a  hardware-detected  error  is  more  likely  to  result  in  a  system 
failure  than  a  software-detected  problem.  An  important  reason  for  the 
better  performance  of  software-detected  problems.  is  due  to  the  fact 
that  software  detects  most  (or  all)  management  type  problems.  This  is 
an  important  function  of  MVS  and  hence  more  carefully  designed  and  bet¬ 
ter  protected  by  recovery  routines. 

Statistics  on  HU/SU  errors  shows  that  about  11  percent  of  software 
errors  are  hardware  related.  About  7  percent  of  software  errors  were 
related  to  temporary  hardware  problems:  more  than  24  percent  of  all 
software  f ai lures  however  were  related  to  permanent  hardware  problems.3 
Taking  all  hardware  errors  into  account  (HW/SM-Temp.  and  HU/SU-Perm. ) 
over  40  percent  of  software  failures  were  determined  to  be  hardware 
related. 

Importantly,  the  analysis  indicates  that  there  is  poor  communication 
between  facilities  detecting  hardware  and  software  problems.  An  analy¬ 
sis  of  the  data  clearly  shows  that  the  system  is  not  able  to  diagnose 
the  fact  that  a  software  error  may  be  hardware  related.  The  key  fea¬ 
tures  of  HU/SU  errors  identified  in  our  data  were: 

1.  Both  hardware  and  software  errors  occur  in  large  clusters 

2.  The  HU/SU  errors  have  a  significant  percentage  of  lost  records. 

3.  The  SU  record  in  a  HU/SU  error  may  have  many  jobs  involved. 


3  In  comparison  [Rossetti  82]  found  that  16  percent  of  software  failures 
on  VM/370  were  hardware  related. 
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A.  The  system  recovery  managment  is  less  likely  to  recover  from  a 
HW/SW  error  than  a  software  error  in  genera). 

It  is  suggested  that  some  of  the  error  patterns  found  in  this  study 
could  form  the  basis  for  detection  of  hardware  related  software  errors. 
It  is  of  course  possible  the  both  the  hardware  error  and  the  software 
error  indicate  no  more  than  a  symptom  of  the  real  problem.  There  is 
some  evidence  in  our  data  to  suggest  that  this  is  a  possible  scenario. 
However,  if  the  detection  was  better  coordinated,  it  is  possible  that  at 
least  system  termination  due  to  temporary  hardware  problems  could  be 
reduced.  Better  communication  between  the  hardware  and  software  error 
detection  mechanisms  may  be  an  area  where  further  effort  toward  allevi¬ 
ating  this  problem  can  be  directed.  There  is  no  doubt  that  more  data 
analysis  and  experimentation  is  necessary  before  patterns  found  in  this 
study  can  be  used  as  a  basis  for  a  suitable  detection  policy. 
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Appendix  A 


MVS  ERROR  DETECTION  AND  RECOVERY  PROCESSING 

A.  1  ERROR  DETECTION 

The  supervisor  in  MVS  offers  many  services  to  detect  and  process  abnor¬ 
mal  conditions  during  system  execution. 

1.  The  hardware  detects  conditions  such  as  memory  violations,  pro¬ 
gram  errors  (arithmetic  exceptions.  invalid  operation  codes), 
addressing  errors  and  password  checking  on  critical  system 
resources. 

2.  The  software  also  provides  detection  of  software  problems. 

The  data  management  and  supervisor  routines  ensure  that  valid 
data  are  processed  and  non-conflicting  requests  are  made.  Exam¬ 
ples  are  the  incorrect  specification  of  a  parameter  in  a  control 
structure  or  in  a  system  macro,  or  a  supervisor  call  issued  by  an 
unauthorized  program. 

The  instal lation  might  improve  the  system  error  detection 
capability  by  means  of  a  software  facility  called  Resource  Access 
Control  Facility  (RACE).  The  RACF  is  used  to  build  detailed 
'profiles'  of  system  software  modules.  These  profiles  are 
defined  in  order  to  inspect  the  correct  usage  of  system 
resources. 

The  user  can  also  employ  other  software  facilities  to  detect 
the  occurrences  of  selected  events.  "Appendages"  are  routines 
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that  enable  the  user  to  get  control  during  different  phases  of  an 
I/O  operation.  The  "Servicabi 1 i ty  Level  Indication  Processing 
(SLIP)  aids  in  error-detection  and  diagnosis  also.  The  SLIP  com¬ 
mand  allows  the  user  to  traps  that  cause  a  program  interruption 
when  particular  events  are  intercepted.  The  user  might  also 
define  his  own  detection  mechanisms  by  means  of  the  "Set  Program 
Interruption  Element"  (SPIE)  macro.  This  macro  instruction 
detects  programmer  defined  exceptions  like  using  an  incorrect 
address  or  attempting  to  execute  privileged  instructions.  Using 
these  facilities,  user  defined  error  conditions  can  be  detected 
in  addition  to  system  provided  program  checks. 

3.  The  operator  might  detect  some  evident  error  condition  and  decide 
to  cancel  or  restart  the  job.  For  example,  the  operator  can 
detect  loop  conditions  or  endless  wait  states. 

A. 2  RECOVERY  PROCESSING 

Whenever  a  program  is  abnormally  interrupted  due  to  the  detection  of  an 
error,  the  Supervisor  gets  control.  If  the  problem  is  such  that  a  fur¬ 
ther  processing  could  degrade  the  system  or  destroy  data,  the  Supervisor 
gives  control  to  the  Recovery  Termination  Manager  (RTM).  If  a  recovery 
routine  is  available  for  the  problem  program,  RTM  gives  control  to  this 
routine  before  processing  the  program  termination. 

Recovery  is  designed  as  a  means  by  which  the  system  can  prevent  total 
loss.  The  purpose  of  a  recovery  routine  is  to  free  the  resouces  kept  by 
the  failing  program  (if  any),  to  locate  the  error  and  to  request  either 


for  a  continuation  of  the  termination  process  or  for  a  retry.  Recovery 
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routines  are  generally  provided  to  cover  all  MVS  functions  [Auslander 
81].  It  is  houever  the  responsibility  of  the  installation  or  of  the 
user  to  urite  recovery  routine  for  other  programs. 

More  than  one  recovery  routine  can  be  specified  for  the  same  program; 
if  the  latest  recovery  routine  asks  for  a  termination  of  the  program, 
the  RTM  can  give  control  to  another  recovery  routine  (if  provided). 
This  process  is  called  'percolation'. 

The  percolation  process  ends  if  either  a  routine  issues  a  valid  retry 
request,  or  no  more  routines  are  available.  In  the  latter  case,  the 
program  and  its  related  subtasks  are  terminated.  The  termination  of  a 
program  might  imply  the  termination  of  jobstep.  If  a  valid  retry  is 
requested,  a  retry  routine  restores  a  valid  status,  using  the  informa¬ 
tion  supplied  by  the  recovery  routine(s),  and  can  give  control  to  the 
program.  In  order  for  a  retry  to  be  valid  the  system  should  verify  that 
there  is  no  risk  of  recurrence  of  the  error  to  the  same  recovery  rou¬ 
tine,  and  that  the  retry  address  is  properly  specified.  Figure  2  illus¬ 
trates  the  steps  in  the  recovery  process. 
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Figure  2:  Software  handling  of  software  errors  on  MVS 


A. 3  ERROR  RECORDING  ON  SYS1.L06REC 

Before  a  recovery  routine  takes  control.  the  RTM  i n i t i t i al i ses  a  work 
area  called  the  System  Diagnostic  Work  Area  (SDUA).  This  area  is  by  the 
RTM  to  communicate  with  the  recovery  routines  and.  to  log  information 
regarding  the  error.  Thus  at  the  end  of  the  recovery  process  the  SDWA 
contains  a  history  of  the  incident  and  the  associated  recov»'  >  process. 
At  the  end  of  the  recovery  process  the  RTM  invokes  the  error  recording 
routines  to  generate  a  record  of  the  incident.  The  data  set  containing 
this  information  is  called  SYS1.L0GREC. 

A  software  record  also  contains  the  information  about  the  event 
(EVENT)  tt.*t  caused  f*  v  'ecord  to  be  generated,  and  a  12  bit  error  symp¬ 
tom  code  (ERRCODE)  describing  the  reason  for  the  program  abnormal  termi- 
These  codes  are  issued  by  the  system  or  by  the  problem  program 


nation. 
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that  used  an  ABEND  macro  instruction.  The  system  and  user  completion 
codes  appear  together  in  the  ERRCODE  field.  User  codes  are  meaningful 
only  for  specific  applications. 

Table  12  describes  the  values  assumed  by  the  variable  EVENT.  Table 
13  gives  some  example  of  common  system  ERRCODE's  encountered  in  this 
study.  The  detection  mechanism  and  the  action  taken  by  the  system  are 
also  described.  More  than  500  different  ERRCODE's  are  issued  by  the 
system  for  a  problem  program. 

Traces  of  the  recovery  process  are  recorded  on  LOGREC.  This  includes 
the  name  and  the  type  of  the  recovery  routine  uhich  handled  the  problem 
(RECNAHE ) .  the  result  (RESULT)  of  the  recovery  process  and  the  impact 
of  the  error  on  the  related  jobstep  (JOBTERM).  A  description  of  these 
fields  is  given  in  Table  14.  Other  data  collected  during  the  recovery 
process.  includes  detailed  program  status  information  such  as  the  con¬ 
tents  of  registers  and  the  program  address  space  identifier.  This  can 
be  helpful  in  error  diagnosis. 

During  the  recovery  process  the  system  basically  attempts  to  maintain 
operation  despite  an  error.  It  is  possible  that  the  recovery  process 
itself  encounters  the  same  error.  In  this  case,  there  exists  the  risk 
of  recursive  recovery  processes.  or  the  generation  of  bad  data.  How¬ 
ever.  such  occurences  can  be  detected  by  analyzing  the  SDUA  field  into 
LOGREC.  If  the  jobname  for  example  is  'NONE-FRR ' ,  this  indicates  that 
the  record  is  generated  by  a  functional  recovery  routine  during  a  recov¬ 
ery  attempt.  Finally,  if  the  recording  process  was  also  affected  by  an 


error,  a  LOSTREC  value  appears  in  the  TYPE  field. 
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TABLE  12 

Event  that  caused  program  termination 


Variable  EVENT 

Values 

Meaning 

MACHECK 

A  hardware  event  caused  a  machine  check  that 
could  not  handle  the  problem 

PROGCHECK 

A  program  check  interrupt  occurred  due  to  the 
detection  of  some  exception  or  to  the  violation 
of  some  memory  protection  mechanism 

TRSFAIL 

A  translation  error,  e.g.,  an  error  occurred 
during  the  storage  allocation  process 

RESTART 

The  operator  pressed  the  restart  key 

ROUTABT 

A  system  service  routine  detected  an  invalid  SVC 
and  issued  an  abnormal  termination  of  the 
program  (ABEND) 

ROUTSVC 

A  system  routine  issued  an  invalid  supervisor 
call  (SVC) 

PROGABT 

The  program  itself  requested  the  ABEND 

SYSABT 


The  system  detected  a  problem  and  forced  a 
program  ABEND 
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TABLE  13 


Examples  of  ABEND  reason  codes 


Hex  code  Explanation 


05A  A  service  routine  that 

handles  real  storage 
deallocation  received 
an  inval id  address 


System  action 


The  program  that  called 
the  service  routine  or 
the  routine  abnormally 
terminates 


The  operator  determined 
that  the  program  was  in 
a  loop  or  endless  uait 
state 


The  operator  pressed  the 
RESTART  key 


Operation  exception:  an 
operation  code  is  not 
assigned 


A  program  interruption 
occurred;  the  task  is 
terminated  if  no  routine 
had  been  specified  to 
handle  the  interruption 


The  error  occurred  during 
the  creation  of  a  data  set 
due  to  the  incorrect  speci¬ 
fication  of  some  data  para¬ 
meter 


The  task  is  terminated 
if  no  routine  has  been 
specified  for  the 
problem  program 
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TABLE  14 

Recovery  information 


Variable  name 

Values 

Meaning 

RECNAME 

8  character 

name 

Name  of  the  recovery 
routine  which  handled 
the  problem 

RESULT 

RETRY 

The  recovery  routine 
decide  that  a  retry 
might  be  successful 

CONTTERfl 

The  recovery  routine 
asks  to  continue  with 
termination  (this  might 
imply  percolation) 

JOBTERM 

YES/NO 

If  J0BTERt1=YES  the  entire 
jobstep  has  to  be 
terminated 
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An  error  may  have  four  possible  effects:  ■■ 

1.  RETRY:  The  system  successfully  recovered  and  returned  control  to 
the  problem  program. 

2.  TASK  TERMINATION:  The  progrma  and  its  related  subtasks  are  ter¬ 

minated  .butt he  system  isnotaffec ted. 

3.  JOB  TERMINATION:  The  job  in  control  at  the  time  of  the  error  is 

aborted. 

4.  SYSTEM  DAMAGE:  The  job  or  task  in  control  at  the  time  of  the 

error  uas  critical  for  system  continuation.  Thus  job/task  termi-  j 

i 

nation  resulted  in  system  failure.  i 


Appendix  B 


Figure  3:  Hour  of  day  plot  of  software  errors 


Figure  4:  Hour  of  day  plot  HU/SM  Temporary  errors 


Figure  5:  Hour  of  day  plot  HW/SW  Permanent  errors 
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